Skip to content

Conversation

@enool
Copy link

@enool enool commented Aug 11, 2020

On conditions where callback supplied to resource.Retry() takes a long
time to execute, a timeout error is returned to caller while the
callback is still running.

To remind the reader, WaitForState creates a separate goroutine for
calling Refresh() callback from StateChangeConf. The execution of
Refresh() goroutine is monitored in the main thread with a timeout.

However, while timeout does issue a cancellation for the Refresh()
goroutine, it does not interrupt the already started Refresh() calls.
In addition, cancellation mechanism itself gave up after grace period
expired on timeouts. This potentially left the Refresh() callback
running in the background while returning TimeoutError to the caller.

When timeout error is returned, plugin(s) incorrectly assume that it is
safe to rerun the Refresh() content. In reality, the state after
timeout is completely unknown.

Annotated example from AWS plugin:

var createResp *iam.CreateRoleOutput
err := resource.Retry(30*time.Second, func() *resource.RetryError {
    var err error
    createResp, err = iamconn.CreateRole(request)                <-- Has internally a retry loop, can block more then 30 seconds
    // IAM users (referenced in Principal field of assume policy)
    // can take ~30 seconds to propagate in AWS
    if isAWSErr(err, "MalformedPolicyDocument", "Invalid principal in policy") {
        return resource.RetryableError(err)
    }
    return resource.NonRetryableError(err)
})
if isResourceTimeoutError(err) {                                <-- Goroutine started in Retry (WaitForState) can still be running
    createResp, err = iamconn.CreateRole(request)               <-- Issues another blocking CreateRole
}

The fix is simply to remove the grace period and the matching select
timeout. This makes WaitForState() block until Refresh() goroutine
returns something to result channel.

On conditions where callback supplied to resource.Retry() takes a long
time to execute, a timeout error is returned to caller while the
callback is still running.

To remind the reader, WaitForState creates a separate goroutine for
calling Refresh() callback from StateChangeConf. The execution of
Refresh() goroutine is monitored in the main thread with a timeout.

However, while timeout does issue a cancellation for the Refresh()
goroutine, it does not interrupt the already started Refresh() calls.
In addition, cancellation mechanism itself gave up after grace period
expired on timeouts. This potentially left the Refresh() callback
running in the background while returning TimeoutError to the caller.

When timeout error is returned, plugin(s) incorrectly assume that it is
safe to rerun the Refresh() content. In reality, the state after
timeout is completely unknown.

Annotated example from AWS plugin:

    var createResp *iam.CreateRoleOutput
    err := resource.Retry(30*time.Second, func() *resource.RetryError {
        var err error
        createResp, err = iamconn.CreateRole(request)                <-- Has internally a retry loop, can block more then 30 seconds
        // IAM users (referenced in Principal field of assume policy)
        // can take ~30 seconds to propagate in AWS
        if isAWSErr(err, "MalformedPolicyDocument", "Invalid principal in policy") {
            return resource.RetryableError(err)
        }
        return resource.NonRetryableError(err)
    })
    if isResourceTimeoutError(err) {                                <-- Goroutine started in Retry (WaitForState) can still be running
        createResp, err = iamconn.CreateRole(request)               <-- Issues another blocking CreateRole
    }

The fix is simply to remove the grace period and the matching select
timeout. This makes WaitForState() block until Refresh() goroutine
returns something to result channel.
@hashicorp-cla
Copy link

CLA assistant check

Thank you for your submission! We require that all contributors sign our Contributor License Agreement ("CLA") before we can accept the contribution. Read and sign the agreement

Learn more about why HashiCorp requires a CLA and what the CLA includes

Have you signed the CLA already but the status is still pending? Recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants